Computational Training: running fastR

Andrea Brizzi and Efthymios Costa

2024-05-28

Great R resources

In general:

  1. Advanced R
  2. R for Data Science
  3. RStudio Cheatsheets
  4. Useful R packages

Covering this session’s topics

  1. data.table
  2. parallel

What is “fast code”?

  1. ✏️ Easy to write
    familiarity with code editor, libraries

  2. 💡 Easy to understand
    structured, with consistent variable names, commented.

  3. 🔧 Easy to debug
    clear naming, DRY, tests.

  4. 🏎️ Easy to run
    (profiling, C++, using “optimized” code).

Good coding practices

Variable naming

cat("cat")
cat <- "cat"
cat(cat)

names should be consistent, descriptive, lower case, readable.

For which snippet is it easier to guess the context?

tmp <-  10
tmp1 <- tmp * 24
cases_per_hour <- 10
cases_per_day <- cases_per_hour * 24

Embrace functional programming I

Functions are first-class citizens in R

  • can be passed as arguments
  • can be returned from other functions
  • can be assigned to variables
  • and more…
first-class-citizenship.R
f <- function(x){x^2}

lapply(1:10, f)

generator <- function(n=2){
    function(x){x^n}
}
cube <- generator(3)

list(one_function = f)

Embrace functional programming II

Rethink for, while loops; “apply” instead

“To become significantly more reliable, code must become more transparent. In particular, nested conditions and loops must be viewed with great suspicion. Complicated control flows confuse programmers. Messy code often hides bugs.”

— Bjarne Stroustrup

…but why?

Say you want to extract the \(R^2\) from three linear models with different predictors (or formulae).

formulae.R
formulae <- c(
    Sepal.Length ~ Sepal.Width,
    Sepal.Length ~ Petal.Length,
    Sepal.Length ~ Species
)
bad way
lm_results2 <- c()

for (formula in formulae) {
    fit <- lm(formula, data = iris)
    r2 <- summary(fit)$r.squared
    lm_results2 <- c(lm_results2, r2)
}
good way
extract_r2 <- function(formula) {
    fit <- lm(formula, data = iris)
    r2 <- summary(fit)$r.squared
    return(r2)
}

lm_results <- sapply(formulae, extract_r2)

What’s the difference?

side effects
exists("fit")

Reproducibility & Generalisability

Why are code reproducibility & generalisability important?

  • ✅ Transparency & Verification
  • 🤝 Collaboration & Longevity
  • 🔍 Quicker detection of errors

Reproducibility

  • Main idea: Be able to reproduce results to ensure they are valid.
  • Common practices:
    • Setting the seed
    • Importing all necessary packages
    • Documenting R environment (sessionInfo())
bad way
x <- rnorm(1)
dtruncnorm(x, -5, 5, 0, 1)
still bad...
set.seed(1234)
x <- rnorm(1)
dtruncnorm(x, -5, 5, 0, 1)
there we go!
library(truncnorm)
set.seed(1234)
x <- rnorm(1)
dtruncnorm(x, -5, 5, 0, 1)

Generalisability

  • Code should be “generalisable” meaning that anyone else can refer to it and use it on their own data.
  • Common practices:
    • ⚙️ Write functions for operations you use frequently
    • 💬 Document your code (add comments)
    • ❓ Add basic checks (testing)
    • ✔️❌ Test your code on different data sets

Generalisability: Example

  • Task: Create a function kmeans_recip that does the following:
    • Takes a data set as input.
    • Computes reciprocals of variables (i.e. for each value \(x\) it computes \(1/x\)).
    • Performs K-Means clustering (function kmeans) on the resulting data set.
    • Returns the cluster assignment for the first 20 observations.
  • This function will be used on the iris data set (with K=3 clusters).

Generalisability: Bad example

don't even bother...
kmeans_recip <- function(){
  for (i in c(1:4)){
    iris[, i] <- 1/iris[, i]
  }
  kmeans_res <- kmeans(iris[, c(1:4)], centers = 3)
  return(kmeans_res$cluster)
}
kmeans_recip()[1:20]
 [1] 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1

Generalisability: OK example

  • The previous example works but it’s really not generalisable…
  • Obvious things that need improvement:
    • Hardcoded values (first 4 columns are numeric in iris)
    • Data set should be an input argument
it's getting better
kmeans_recip <- function(data, cont_cols){
  for (i in cont_cols){
    data[, i] <- 1/data[, i]
  }
  kmeans_res <- kmeans(data[, cont_cols], centers = 3)
  return(kmeans_res$cluster)
}
kmeans_recip(data = iris, cont_cols = c(1:4))[1:20]
 [1] 3 3 3 3 3 1 1 3 3 3 3 3 3 3 3 1 1 1 1 1

Generalisability: Good example

  • We can still do better! More things to consider:
    • Division by 0 is not allowed
    • Determine numerical variables automatically
    • Ensure there exists at least one continuous variable
    • Add some comments
that's looking good
kmeans_recip <- function(data){
  # Obtain numerical variables
  cont_cols <- which(sapply(data, is.numeric))
  # Check there is at least 1 numerical variable
  if (length(cont_cols)==0) stop('No numerical variables!')
  for (i in cont_cols){
    # Check if numerical variable takes 0 value
    ifelse(any(data[, i] == 0),
           stop('Division by 0 not allowed!'),
           data[, i] <- 1/data[, i])
  }
  # Apply K-Means clustering
  kmeans_res <- kmeans(data[, cont_cols], centers = 3)
  return(kmeans_res$cluster)
}
kmeans_recip(data = iris)[1:20]
 [1] 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 1 1 1

Generalisability: Can we generalise?

  • Now we can also check if this works on the diamonds data set from the ggplot2 package.
seems like it works!
library(ggplot2)
kmeans_recip(data = diamonds)[1:20]
Error in ifelse(any(data[, i] == 0), stop("Division by 0 not allowed!"), : Division by 0 not allowed!

Debugging

  • 🐛 Debugging is the process of finding and fixing errors in your code.
  • 🤬 Generally annoying but can be made much easier with a few simple steps!

Debugging made easy

  • Common practices to debug your code more easily include:
    • 📖 Read error messages carefully
    • 🖨️ Use print messages (cat(), print(), message() etc.)
    • 💾 Save objects that may be causing the error (to be able to reproduce it)
    • 📑 A bit more “advanced” options are traceback(), browser(), debug() (more details here)
    • 💻 Don’t forget: Google, Stack Overflow, ChatGPT etc. are your friends

Using breakpoints for debugging

  • Breakpoints are markers you can set in your code for execution to stop.
incorrect function
calculate_sum <- function(x){
  total <- 0
  n <- length(x)
  for (i in 1:n) {
    total <- total + x[[i]]
  }
  return(total)
}

result <- calculate_sum(list(c(1), c(2), c(3), c('4')))
Error in total + x[[i]]: non-numeric argument to binary operator

Error Handling

  • 😰 Sometimes errors cannot be avoided…
  • 🎉 Good news is we can handle them!
  • 💊 Error handling is the process of responding to and recovering from error conditions.

Error Handling: the tryCatch() syntax

  • tryCatch() is the function to use for error handling in R.
basic tryCatch syntax
result <- tryCatch({
  expr
  },
  warning = function(w){
    warning-handler-code
    },
  error = function(e){
    error-handler-code
    },
  finally = {
    cleanup-code
  }
)

Error Handling: basic example

tryCatch example
safe_log <- function(x){
  result <- tryCatch({
    log(x) # Attempt to calculate the logarithm
  },
  warning = function(w){
    message("A warning occurred: ", w) # Handle warnings
    NULL # Return NULL if a warning occurs
  },
  error = function(e){
    message("An error occurred: ", e) # Handle the error
    NA  # Return NA if an error occurs
  },
  finally = {
    # This block executes no matter what
    message("Logarithm attempt completed.")
  })
  return(result)
}

Error Handling: basic example

tryCatch works!
ex_list <- list(c(-8), c("zero"), c(2024))
for (i in 1:length(ex_list)){
  print(safe_log(ex_list[[i]]))
}
A warning occurred: simpleWarning in log(x): NaNs produced
Logarithm attempt completed.
NULL
An error occurred: Error in log(x): non-numeric argument to mathematical function

Logarithm attempt completed.
[1] NA
Logarithm attempt completed.
[1] 7.612831

Code optimization

Throwback to {l,s}apply

We previously showed how you can avoid side effects through by letting functions be the protagonists of your code.

extract_r2 <- function(formula) {
    fit <- lm(formula, data = iris)
    r2 <- summary(fit)$r.squared
    return(r2)
}

lm_results <- sapply(formulae, extract_r2)

What if each iteration in the loop is computationally expensive?

The parallel package

You can imagine wanting to run each of the apply/for loop iterations in parallel.

sockets (not on Windows)
library(parallel)
f <- function(i) {
    lme4::lmer(
        Petal.Width ~ . - Species + (1 | Species),
        data = iris)
}

system.time(save1 <- lapply(1:100, f))
##    user  system elapsed
##   2.048   0.019   2.084
system.time(save2 <- mclapply(1:100, f))
##    user  system elapsed
##   1.295   0.150   1.471
forking
num_cores <- detectCores()
cl <- makeCluster(num_cores)
system.time(save3 <- parLapply(cl, 1:100, f))
#    user  system elapsed 
#   0.198   0.044   1.032 
stopCluster(cl)
  • requires further attention

The Rcpp package

As ChatGPT exists, it is now easier to write custom C++ functions to speed up your code.

  • Always test the output!
Rcpp::cppFunction("
double standard_normal_log_likelihood(NumericVector data) {
  int n = data.size();
  double log_likelihood = -0.5 * n * log(2 * M_PI);
  for (int i = 0; i < n; i++) {
    log_likelihood -= 0.5 * data[i] * data[i];
  }
  return log_likelihood;
}
")

x <- rnorm(10000000)
system.time(standard_normal_log_likelihood(x))
#    user  system elapsed 
#   0.021   0.000   0.022 
system.time(sum(log(dnorm(x))))
#    user  system elapsed 
#   0.284   0.016   0.300 

Introduction to data.table

  • data.table is a package that extends the data.frame class.
  • (Quicker) alternative to dplyr for large datasets.
  • cheatsheet

all you need to know

dt[i, j, by]

  • use the data.table called dt
  • subset it on the rows specified by i
  • and manipulate columns with j
  • grouped according to by.
library(data.table)
iris_dt <- as.data.table(iris)

iris summaries

Subsetting and Summarizing

dt[i, j, by]

  • Say we want to calculate the mean sepal length for the setosa species
iris_dt[Species == "setosa", mean(Sepal.Length)]
[1] 5.006

Grouping and Aggregating

dt[i, j, by]

  • Now we want to repeat the above for every species, in one command.
iris_dt[, mean(Sepal.Length), by=Species]
      Species    V1
1:     setosa 5.006
2: versicolor 5.936
3:  virginica 6.588

iris summaries (.VARS)

dt[i, j, by]

Counting entries (.N)

  • Count the total number of observations per species, with sepal length > 5.
iris_dt[Sepal.Length>5, .N, by=Species]
      Species  N
1:     setosa 22
2: versicolor 47
3:  virginica 49

Print 1st entry of each group (.SD)

dt[i, j, by]

  • Print the first entry of each group (species).
iris_dt[, .SD[1], by=Species]
      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
1:     setosa          5.1         3.5          1.4         0.2
2: versicolor          7.0         3.2          4.7         1.4
3:  virginica          6.3         3.3          6.0         2.5

Modifying the dt (in place)

dt[i, j, by]

You will need a walrus (operator):

:=

  • used in j
  • name := vector to act on a single column.
  • (names) := list of vectors to act on multiple columns.

Modifying the dt (in place)

Define a new column

  • Say you want species names to sound more Italian.
iris_dt[, Species_ita := paste0(
    Species, 'ino'
), by=Species]
iris_dt[, unique(Species_ita)]
[1] "setosaino"     "versicolorino" "virginicaino" 

Modify existing column

dt[i, j, by]

  • The “aino” ending does not sound right, let’s remove the “a”.
iris_dt[, Species_ita := gsub(
    "aino","ino",
    Species_ita
), by=Species_ita]
iris_dt[, unique(Species_ita)]
[1] "setosino"      "versicolorino" "virginicino"  

Acting on multiple columns

dt[i, j, by]

Apply a function to multiple columns

  • Let us round the numeric columns to integers…
numeric_cols <- names(iris_dt)[sapply(iris_dt, is.numeric)]
iris_dt[, lapply(
    .SD, as.integer
), .SDcols = numeric_cols] |> head(2)
   Sepal.Length Sepal.Width Petal.Length Petal.Width
1:            5           3            1           0
2:            4           3            1           0

Define multiple columns at once

  • … and store them in columns with the same name, but with a _int suffix.
new_cols <- paste0(numeric_cols, "_int")
iris_dt[, (new_cols) := lapply(
    .SD, as.integer
), .SDcols = numeric_cols]; head(iris_dt, n=2)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_ita
1:          5.1         3.5          1.4         0.2  setosa    setosino
2:          4.9         3.0          1.4         0.2  setosa    setosino
   Sepal.Length_int Sepal.Width_int Petal.Length_int Petal.Width_int
1:                5               3                1               0
2:                4               3                1               0

Summary and advanced topics

  • data.table is a powerful package for data manipulation.
  • quick to write and quick to run BUT not the easiest to read.

. . .

Advanced topics:

Profiling

  • Identify (and hopefully fix!) bottlenecks in your code.
  • The profvis package is a good package to use for this purpose.

Profiling Example: Column Means

library(profvis)
library(data.table)
n <- 4e5
cols <- 150
data <- as.data.frame(x = matrix(rnorm(n * cols, mean = 5), ncol = cols))
data <- cbind(id = paste0("g", seq_len(n)), data)
dataDF <- as.data.table(data)
numeric_vars <- setdiff(names(data), "id")

profvis({
  means <- apply(data[, names(data) != "id"], 2, mean)
  means <- colMeans(data[, names(data) != "id"])
  means <- lapply(data[, names(data) != "id"], mean)
  means <- vapply(data[, names(data) != "id"], mean, numeric(1))
  means <- matrixStats::colMeans2(as.matrix(data[, names(data) != "id"]))
  means <- dataDF[, lapply(.SD, mean), .SDcols = numeric_vars]
})

Profiling Example: Column Means

Packages for all your needs